# High-Resolution Image Understanding

Eurovlm 9B Preview
Apache-2.0
EuroVLM-9B-Preview is a multimodal vision-language model based on the long-context version of EuroLLM-9B, supporting multiple languages and visual tasks. It is currently in the preview version.
Image-to-Text Transformers Supports Multiple Languages
E
utter-project
156
2
Janus Pro 7B
MIT
Janus-Pro is an innovative autoregressive framework that unifies multimodal understanding and generation capabilities. By decoupling visual encoding paths and employing a single Transformer architecture, it resolves conflicts in the roles of visual encoders between understanding and generation.
Text-to-Image Transformers
J
deepseek-ai
139.64k
3,355
Paligemma2 28b Pt 896
PaliGemma 2 is a Vision-Language Model (VLM) launched by Google, combining the capabilities of the Gemma 2 language model and SigLIP vision model, supporting image and text inputs to generate text outputs.
Image-to-Text Transformers
P
google
116
48
Paligemma2 28b Mix 448
PaliGemma 2 is a vision-language model based on Gemma 2, supporting image+text input and text output, suitable for various vision-language tasks.
Image-to-Text Transformers
P
google
198
26
Paligemma2 10b Pt 896
PaliGemma 2 is a Vision-Language Model (VLM) launched by Google, integrating Gemma 2 capabilities, supporting image and text input to generate text output
Image-to-Text Transformers
P
google
233
32
Paligemma2 10b Pt 448
PaliGemma 2 is Google's upgraded vision-language model (VLM) that combines Gemma 2 capabilities, supporting image and text input to generate text output.
Image-to-Text Transformers
P
google
282
14
Paligemma2 3b Pt 448
PaliGemma 2 is a vision-language model based on Gemma 2, supporting image and text input to generate text output, suitable for various vision-language tasks.
Image-to-Text Transformers
P
google
3,412
45
Paligemma2 3b Ft Docci 448
PaliGemma 2 is an upgraded vision-language model released by Google, combining the capabilities of Gemma 2 and SigLIP vision models, supporting multilingual vision-language tasks.
Image-to-Text Transformers
P
google
8,765
12
Llama 3.1 8B Dragonfly V2
Dragonfly is a multimodal vision-language model fine-tuned with instructions based on Llama 3.1, supporting joint understanding and generation of images and text
Image-to-Text English
L
togethercomputer
113
1
Convllava JP 1.3b 1280
ConvLLaVA-JP is a Japanese vision-language model that supports high-resolution input and can engage in conversations about input images.
Image-to-Text Transformers Japanese
C
toshi456
31
1
Cogvlm2 Llama3 Chat 19B Int4
Other
CogVLM2 is a multimodal dialogue model based on Meta-Llama-3-8B-Instruct, supporting both Chinese and English, with 8K context length and 1344*1344 resolution image processing capabilities.
Text-to-Image Transformers English
C
THUDM
467
28
360VL 70B
Apache-2.0
360VL is an open-source large multimodal model developed based on the LLama3 language model, featuring powerful image understanding and bilingual text support capabilities.
Text-to-Image Transformers Supports Multiple Languages
3
qihoo360
103
10
Cogvlm2 Llama3 Chinese Chat 19B
Other
CogVLM2 is a multimodal large model built on Meta-Llama-3-8B-Instruct, supporting both Chinese and English with powerful image understanding and dialogue capabilities.
Text-to-Image Transformers English
C
THUDM
118
68
Cogvlm2 Llama3 Chat 19B
Other
CogVLM2 is a multimodal large model built upon Meta-Llama-3-8B-Instruct, supporting image understanding and dialogue tasks with 8K context length and 1344x1344 image resolution processing capability.
Text-to-Image Transformers English
C
THUDM
7,805
212
360VL 8B
Apache-2.0
360VL is a multimodal model developed based on the LLama3 language model, featuring powerful image understanding and bilingual dialogue capabilities.
Text-to-Image Transformers Supports Multiple Languages
3
qihoo360
22
13
Paligemma 3b Pt 896
PaliGemma is a versatile lightweight vision-language model (VLM) that supports image and text inputs and generates text outputs. It has multilingual capabilities.
Image-to-Text Transformers
P
google
1,788
119
Paligemma 3b Ft Ocrvqa 448
PaliGemma is a versatile lightweight vision-language model (VLM) developed by Google, built on the SigLIP vision model and Gemma language model, supporting both image and text inputs with text outputs.
Image-to-Text Transformers
P
google
365
6
Xgen Mm Phi3 Mini Instruct R V1
xGen-MM is the latest foundational large multimodal model series developed by Salesforce AI Research, based on improvements to the BLIP series, featuring powerful image understanding and text generation capabilities.
Image-to-Text Transformers English
X
Salesforce
804
186
Llava Llama 3 8b V1 1 Gguf
A multimodal model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image understanding and text generation
Image-to-Text
L
xtuner
9,484
216
Llava Llama 3 8b V1 1 Transformers
A LLaVA model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image-text-to-text tasks
Image-to-Text
L
xtuner
454.61k
78
Yi VL 34B
Apache-2.0
Yi-VL-34B is an open-source multimodal model from the Yi series, capable of understanding image content and engaging in multi-turn conversations, with outstanding performance on the MMMU and CMMMU benchmarks.
Image-to-Text
Y
01-ai
150
263
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase